Effects of age and gender on using gene expression markers as prognostic markers for diffuse large B cell lymphoma (DLBCL)
MSt Healthcare Data Science
Authentication of practice
I confirm that I have fully read and understood the assignment
brief for this module.
Y
Details
Name: Chung Yan
Surname: Yu
Submission Date
19-10-25
Word Count: whole assignment including codes
Word Count: main body excluding abstract, references and
supplementary materials
University statement of originality
This assignment is the result of my own work and includes nothing
which is the outcome of work done in collaboration except as declared in
the Preface and specified in the text. It is not substantially the same
as any that I have previously submitted for a degree or diploma or other
qualification at the University of Cambridge or any other university or
similar institution, or that is being concurrently submitted, except as
declared in the Preface and specific in the text. I further state that
no substantial part of my Portfolio has already been submitted, nor is
being concurrently submitted for any such degree, diploma or other
qualification at the University of Cambridge or any other university or
similar institution except as declared in the Preface and specified in
the text.
I confirm the statement of originality as above
Y
Questions for reflection
Self-assessment is an important aspect of feedback literacy, which
is, in turn, key to the development of expertise. As you proceed through
the MSt Healthcare data science programme, we hope that you will make
use of the following prompts to assess your own work on assignments.
Specific assignment briefs will likely indicate which of these to
address for which assessments, but, in general, we expect you to respond
to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it
brief. For each question you answer, limit yourself to no more than
three items. And please remember, this is optional and developmental:
these cover sheets are designed to create space for self-assessment and
feedback dialogue, rather than additional assignment workload.
- Which aspects of this assignment are you most uncertain about and/or
would most like to receive feedback on?
- What elements are you left pondering after this assignment that you
would like to discuss further?
- How have you incorporated feedback from peers and tutors into this
assignment?
- How, and to what extent, have you been able to incorporate feedback
on previous course work into this assignment?
- Using the wording in the rubric, how would you describe the quality
of the different aspects of your work?
Declaration of the use of generative AI
Which permitted use of generative AI are you acknowledging?
Semantic search of literature and notes, outline creation, code
debugging, output formatting and generation, informed feedback
Which generative AI tool did you use (name and version)?
Claude Code v2.0.3x (the version likely changed over usage), Claude
Desktop v1.0.3x with PubMed Connector, M365 CoPilot
What did you use the tool for?
Searching for relevant journal publications, sanity check on coding
strategies, collating notes in my Obsidian vault, debugging code by
parsing error messages
How have you used or changed the generative AI’s output
E.g. my collated notes always stay in point form so that I write out
the paragraphs in my own words. Feedback provided by the GenAI models
are weighed and assessed before being acted on. Code generated are
checked, e.g. it tried to run Levene test on a model fitted using
residuals as the response, this was rectified back to the relevant
response variable.
Abstract
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin
lymphoma, with prognosis influenced by demographic features (age,
gender), integrated scoring systems such as the International Prognostic
Index (IPI) and genetic alterations. Oncogenes MYC, BCL2, BCL6 emerged
with strong DLBCL associations, and further studies have identified over
150 genetic drivers. Yet whether demographic factors modify these
genetic markers’ prognostic effects remains understudied. This study
analysed the gene expression data of 21 genes from 775 DLBCL patients to
investigate whether age and gender modify gene expression-response
associations. Ten genes demonstrated significant age-dependent
expression levels (FDR adjusted p < 0.05), including BCL2, CCND2 and
LMO2. However, hierarchical logistic regression models revealed only
CCND2 showed significant interaction effects (Likelihood ratio test
adjusted p = 0.006) with age when it comes to association with complete
response. Older age groups showed significant increase (61-75 odds ratio
p = 0.014; >75 odds ratio p = 0.001) in complete response probability
with higher CCND2 expression, while younger patients (≤40, 41-60) showed
the opposite insignificant trend. Gender revealed no significant effects
(additive or modifying) across all genes. Only MYC gene expression level
showed significant gene-only association (likelihood ratio test adjusted
p = 0.010) with response completeness. While the modifying effects of
CCND2 and age on prognosis could reconcile certain conflicting
viewpoints in DLBCL, this dynamic requires further scrutiny and
replication elsewhere. With only one out of 42 tested gene-demographic
combination (21 genes × 2 demographic groups) showing interactive
effects, demographic modifying effects remains a rarity amongst genetic
markers. Thus a more conservative and universal approach with these
genomic markers could still be the way forward.
Introduction
Diffuse large B cell lymphoma (DLBCL) is the most common type of
aggressive non-Hodgkin’s lymphoma (NHL)1,2,
with a wide range of factors contributing to prognosis. For a comparable
and more comprehensive pre-treatment prognosis model, the international
prognostic index (IPI) emerged as a standardised model for DLBCL and
other aggressive NHLs3, combining 5 factors: age, Ann
Arbor stage classification, ECOG Performance Status, serum lactate
dehydrogenase (LDH) level and the number of extranodal sites of disease.
By pooling factors together, the IPI outperformed simple disease stage
classification models such as Ann Arbor.4 went on to improve on
the IPI, creating the National Comprehensive Cancer Network
International Prognostic Index (NCCN-IPI) as an update to the IPI. A key
innovation of the NCCN-IPI was using cubic splines to model the
continuous factors of age and LDH levels as higher resolution
categorical factors, modifying them from binary factors to quaternary
and ternary factors respectively. As prognosis models improved, so did
treatment. The first generation chemotherapy of cyclophosphamide,
doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of
30-35%5. With the addition of
rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes
curable for more than 60% of patients6.
While age is incorporated into the IPI (and subsequent NCCN-IPI),
other demographic factors such as gender are not, despite established
evidence for their prognostic involvement. Males demonstrate higher
hazard ratios7 and gender-associated
pharmacokinetic differences in rituximab lead to worse treatment
responses in males8. These demographic
effects raise questions about whether risk factors perform uniformly
across patient subgroups. Beyond demographic and clinical factors, a
series of genetic alterations have emerged as strong prognostic markers:
BCL29, BCL610
and MYC11. These changes are
observed at both the genotypic and phenotypic levels, as translocation
mutations and protein overexpression respectively12. Furthermore,
patients with varying combinations of these alterations have shown worse
responses to R-CHOP treatments, highlighting the clinical relevance of
genomic profiling. Some of these genetic markers, such as BCL613 and MYC14, have shown
association with patient age, adding another layer of interaction that
could influence prognosis.13 further
demonstrated that IRF4 translocations are associated with age, and along
with BCL6 and other genetic drivers, collectively lose prognostic
significance when age is incorporated in multivariate models. This
indicates that demographic factors may not merely add to genomic risk
additively, but could modify how these genomic drivers influence
clinical outcomes.
With the advent of machine learning methods came attempts to
construct models based on clinical and genomic data.15
performed whole exome and transcriptome sequencing in 1001 DLBCL
patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC
alongside numerous newly characterised candidates. By combining
genotypic alterations, gene expression data and cell-of-origin
classification, they developed a genomic risk model that outperformed
the IPI. However, the potential for demographic factors to modify these
genomic associations remains understudied. Given that age and gender
both demonstrate prognostic significance and influence treatment
response, it remains unknown whether genomic risk markers perform
uniformly across demographic subgroups. If these demographic factors
modify how gene expression levels impact treatment response, such that
the same gene expression levels in different gender or age groups lead
to different prognostic outcomes, it would be crucial that these
differences are identified and applied in future models to prevent
inaccurate prognoses.
This study addresses this gap by investigating whether age and gender
interact and modify the prognostic associations between gene expression
markers and therapy response in DLBCL. The associations between the
demographic factors (age & gender) with complete therapy response
and the gene expression levels of 21 genes were first considered.
Logistic regression models using gene expression levels of 21 genes were
fitted with therapy response, and the interaction of demographic factors
with these gene expression levels were examined. Exploring how
demographic factors interact with known and potential genomic drivers of
DLBCL could reveal age or gender-dependent tumour biology that inform
therapeutic approaches.
Here the same datasets16 from the genomic risk
model study by15 are utilised,
specifically their clinical information dataset from Supplementary
Table S1 and gene expression dataset from Supplementary
Table S2. The clinical information dataset comprises 1001 DLBCL
patients with 35 variables. Patients are recorded with anonymised IDs
with the variables of interest to us being gender, response to initial
therapy, age at diagnosis, and the expression level of MYC, BCL2 and
BCL6 genes based on log2 transformation of RNA-sequencing Fragments per
kilobase (FPKM) value. These columns are highlighted and used in further
analysis not only due to their relevance, but also their relative
completeness. The gene expression dataset comprises the gene expression
level (log2 transformed again) of 19 genes (1 gene BCL6 is also recorded
in the clinical information dataset) for 775 patients recorded using the
same anonymised IDs.
Methods
The datasets are available via Elsevier’s open
access license, as shown on the ScienceDirect webpage
of15’s publication. For their
work, the authors obtained anonymised patient clinical information and
tumours, which were processed in line with a protocol approved by the
Institutional Review Board at Duke University. A total number of 2
datasets were downloaded as Excel files directly from their links on the
aforementioned ScienceDirect, Table
S1 and Table
S2 using the download.file function from R’s
utils package17 and read via the
readxl package18. The specific sheets were
selected then cleaned to be exported as .csv files (see 01-data-preprocessing.R20). The
processed .csv files were uploaded to this project’s GitHub
repository: mmc1-ClinicalInformation.csv
(raw), mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient and contains code
for the relevant analysis and plots shown only, for a complete analysis
including assumption testing, post-hoc analysis, tests that showed
insignificant results, please refer to the supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted21 HTML document, both datasets were
loaded directly from ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
The 2 datasets were joined using common patient IDs and 2 new columns
were added. age_group_nccn was obtained by transforming the
age at diagnosis variable into 4 categorical groups according to cutoffs
defined by4’s NCCN-IPI: \(≤40, 41 - 60, 61 - 75, >75\).
complete_response was added by combining “Partial response”
and “No response” as “Incomplete response” from
response_therapy while keeping “Complete response”
unchanged. This prepared the dataset for hierarchical binomial logistic
regression. As for duplicate gene expression levels for BCL6, the two
columns were compared to verify identicality and 1 was discarded. The
resulting table was one with 775 patients and 27 variables, of which
include the anonymised patient ID; demographic variables such as gender
(binary: F/M), age (continuous, at diagnosis) and age group (quaternary,
as defined by NCCN-IPI); clinical information of therapy response
(ternary: Complete response, Partial response, No response); and the
log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6,
ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1,
BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.
Gene expression data were limited to these 21 genes available
from16’s datasets
(Supplementary Tables S1
& S2),
representing 2 categories. The first category is the “big 3” oncogenic
drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers.
Moreover, MYC14 and BCL613 have both
shown age-dependent prognostic effects, suggesting age may modify how
these markers influence clinical outcomes. The second category comprises
19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB)22,
with BCL6 appearing in both categories. Despite their initial utility
for classification, gene members such as CCND2 and LMO223 have shown to be
strong survival predictors alongside BCL2 and BCL6. More importantly,
IRF4 is among the genetic features showing age associations and
contributes to genetic complexity that loses prognostic significance
when age is incorporated13, alongside
BCL6. Both categories therefore contain genes with potential to interact
with demographic factors, making them suitable for testing whether
demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 4 columns all with more than
93% completeness. Processing is documented in 02-data-processing.R.
Characteristics of the cohort are summarised24 in
Table 1.
Table 1. Baseline Cohort Characteristics
Characteristic
Total (N=775)
Complete Response
Incomplete Response
Sample size
N = 775
n = 598
n = 126
Gender
Female
338 (43.6%)
258 (43.1%)
54 (42.9%)
Male
437 (56.4%)
340 (56.9%)
72 (57.1%)
Age at diagnosis, years
61 ± 15.4
59.8 ± 15.7
63.1 ± 14.3
Age groups (NCCN-IPI)
≤40 years
75 (10.1%)
67 (11.5%)
8 (6.5%)
41-60 years
249 (33.5%)
202 (34.8%)
41 (33.1%)
61-75 years
285 (38.3%)
218 (37.5%)
47 (37.9%)
>75 years
135 (18.1%)
94 (16.2%)
28 (22.6%)
Demographic and clinical characteristics stratified by treatment
response.
Values shown as n (%) for categorical variables and mean
± SD for continuous variables.
Treatment response categories
combine partial and no response as “Incomplete response” versus complete
response.
NCCN-IPI age groups represent National Comprehensive
Cancer Network International Prognostic Index classifications.
Percentages are calculated within each response group for gender and age
group distributions.
First, the variation of gene expression levels of the 21 genes in
different gender and age groups were scrutinised using the Student’s
t-test and one-way Welch’s ANOVA25 respectively, and with false
discovery rate (FDR) adjustment for the p-values26 due to the
number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were then conducted for significant ANOVA
results27. Assumptions for these tests
were checked with diagnostic tests and plots, and the overall
distribution of gene expression against age and gender visually examined
(see 03-data-exploration.R).
Second, the association of therapy response completeness with gender and
age groups were studied with the Chi-squared tests with expected
frequencies checked to be \(> 5\) in
05-response-analysis.R.
Finally, to study potential interactions between demographic factors
and gene expression levels on predicting therapy response, hierarchical
logistic regression models28 were fitted to systematically
examine the effects of these factors and their interactions. The models
were nested as such:
- Null model: therapy response ~ 1
- Gene-only model: therapy response ~ gene expression
- Main effect model: therapy response ~ gene expression + demographic
factors (age or gender)
- Interaction model: therapy response ~ gene expression + demographic
factors (age or gender) + gene-demographic interactions
The fitted models were then analysed with the likelihood-ratio test
(LRT) to obtain FDR-corrected p-values. Significant models were then
checked for model fit with goodness-of-fit with Chi-square tests and
Akaike Information Criterion (AIC) scores, along with influential
observations and collinearity29 (see 06-effect-analysis.R.)
Results
Gene Expression by Demographics
Gene expression levels showed no significant difference across gender
groups after t-test with FDR correction, indicating gender does not
significantly affect baseline gene expression of these genetic markers.
Conversely, 10 genes showed significantly different gene expression
activity after Welch’s ANOVA with FDR correction across NCCN-IPI age
groups: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.010), SH3BP5
(p = 0.016), ITPKB (p = 0.020), CCND2 (p = 0.020), PTPN1 (p = 0.031),
BMF (p = 0.034), FUT8 (p = 0.037), LMO2 (p = 0.038). Post-hoc analysis
with Games-Howell revealed that age groups mostly clustered into two
statistially distinct groups for each gene, save for BLNK with three
groups Figure 132. This points to
age-dependent tumour biology at the transcription level.
Figure 1. Gene Expression Patterns Across Age
Groups.
Fig. 1
Box plots show gene expression distribution
by age group for genes with significant Welch’s ANOVA results
(FDR-adjusted p < 0.05).
Compact letter display (CLD) letters
(a, b, c) above each boxplot indicate statistical groupings from
Games-Howell post-hoc tests within each gene
-
groups with different letters differ significantly (p < 0.05).
Sample sizes (n) are shown below each age group.
Outliers are
highlighted in darker borders.
Therapy Response by Demographics
Neither gender (p = 1.000) nor age group (p = 0.173) showed
significant associations with complete response, although incomplete
response proportions did trend upwards as age increased. While neither
demographic factors showed prognostic potential on this clinical
outcome, whether they modify the gene expression-therapy response
relationship is to be explored in the next section.
Therapy Response by Gene Expression
As a part of fitting hierarchical models for each gene with either
age or gender groups, gene-only models on predicting therapy response
were also fitted (Figure 2). This revealed that
one gene out of 21, MYC, displayed significant gene-only effects (LRT
FDR-adjusted p-value = 0.010, AIC = 647.57, Chi-square goodness-of-fit
p-value = 0.947), with higher expression predicting lower complete
response probability. With the lowest AIC score among the 8 MYC models,
adding demographic or demographic-MYC intereaction did not improve model
fit, indicating MYC’s prognostic effect is consistent across age and
gender groups (Figure 3a).
Figure 2. Gene-only Effects on Complete Response
Fig. 2
Orange: Adjusted significant
Blue:
Adjusted insignificant
Lilac: Insignificant
Confidence interval:
95%
MYC is the only gene showing significant gene effect after FDR
correction, being well clear of the 1.0 reference (no effect) reference
line even when considering its CI error bars.
Gene and Demographic Interactions Predicting Therapy Response
Systematic testing of 21 genes across both age and gender yielded 42
gene-demographic models, of which no gender model showed significant
effect. Neither gene-gender interaction nor gender main effect models
attained significant results for any gene after FDR adjustment. For age,
only the CCND2 gene showed significant (LRT FDR-adjusted p-value =
0.006, AIC = 643.12, Chi-square goodness-of-fit p-value = 0.973) effects
amongst the interaction models of all 21 genes, and the rest 20 genes
showed neither main (additive) nor interactive effects with age. Having
the lowest AIC score (min delta = 12.7) amongst the 4 CCND2 models
indicate that response completeness is best predicted with gene
expression when its interaction with age is factored in.
Clear age-dependent directional effects emerge for CCND2 (Figure 3b), with higher CCND2 expression levels
affecting the response completeness of the age groups differently. The
curves for age groups ≤40 and 41-60 and the curves for age groups 61-75
and >75 are going in opposite directions. Complete response
probability decreases with CCND2 expression level for the 2 younger
groups, while complete response probability increases with CCND2
expression level for the 2 older groups. Further scrutinising the odds
ratio33 of the 4 age groups (Figure 4) confirms that both older age groups showed
significantly higher complete response odds (61-75 p = 0.014; >75 p =
0.001), and younger age groups (≤40, 41-60) have trends for worse but
statistically insignificant odds.
Figure 3. Logistic Regression Models - Gene Expression
Effects on Complete Response
Fig. 3
A. Logistic regression
plot of MYC gene expression against complete response probability,
age-interaction model
B. Logistic regression plot
of CCND2 gene expression against complete response probability,
age-interaction model
The MYC gene showed no significant
age-interaction effects, contrasting the curves of the different age
groups for CCND2.
Figure 4. CCND2 Interaction Effects - Age Group Modifying
Gene Expression Effects on Complete Response
Fig. 4
Odds ratio of age groups
Orange:
Significant
Lilac: Insignificant
Confidence interval: 95%
Both
older age groups showed significant modifying effect.
Discussion
While gender had displayed potential to influence clinical
outcomes34, the NCCN-IPI did
not include gender as the authors found gender did not substantially
enhance their model’s predictive capability4. The lack of
gender-dependent gene expression patterns and gene-modifying effects on
prognosis in this analysis further confirms observation.
Of the big three oncogene drivers of DLBCL, only MYC showed
significant (gene-only) effects. MYC’s inclusion was in a sense expected
as its role as an oncogene is well established35. Whereas
increase in BCL2 expression has been associated with inferior
outcomes36 and BCL6 expression
bodes well for prognosis38. Both genes however
have significant LRT p-values when not corrected by FDR, BCL2’s
gene-only model (unadjusted p = 0.039) and BCL6’s interaction model
(unadjusted p = 0.037), so perhaps their effects were merely not
significant enough for this dataset.
Connecting the two genes of significance of this study, MYC and
CCND2, is that the oncoprotein product of MYC actually binds to the
promoter of CCND2 in vivo39, yet within this
dataset, the expression of these genes affect different age groups
differently. The revelation of CCND2’s age modifying effects might
actually provide insights to one of the more contested topics in the
field, as there are conflicting opinions and associations when it comes
to CCND2, where the activity of the gene could indicate better40 and worse41
prognoses. Obviously, more work needs to be conducted to establish this
modifying effect of age-CCND2 on prognosis is robust and not
spurious.
In the end, it has to be acknowledged that this study is not without
flaws. The available dataset, especially gene expression data, was
pre-selected for publication, not for the exploratory analysis of
demographic effects on the prognostic powers of genomic drivers. The
dataset also lacked ethnicity information, when it was likely patients
from different continents were included. It would also be of great boon
to the study to be able to benchmark the analysis here with Reddy’s
genomic risk model. Unfortunately, the (hyper-)parameters required to
rebuild such model along with the interactive webtool for this model are
no longer accessible. This further highlights reproducibility in cancer
research42 and
bioinformatics43 in general,
and make me appreciate the emphasis of robust replicability our
lecturers had imparted on us during the first module.
Conclusion
This analysis of 775 DLBCL patients discovered 10 genes BCL2, BLNK,
MYBL1, SH3BP5, ITPKB, CCND2, PTPN1, BMF, FUT8 and LMO2, showed
age-dependent gene expression levels. One gene CCND2 showed modifying
effects with age on predicting therapy outcomes, and the MYC gene showed
gene only effects with age. The discovery of CCND2’s modifying effect
could partially resolve the conflicting accounts of how CCND2 activity
associate with prognosis. This is an exciting find that still needs work
to replicate and validate.
References
1. The
Non-Hodgkin’s Lymphoma Classification Project. A clinical evaluation
of the international lymphoma study group classification of
non-hodgkin’s lymphoma. Blood 89,
3909–3918 (1997).
2. Wang, S. S. Epidemiology
and etiology of diffuse large b-cell lymphoma. Seminars in
Hematology 60, 255–266 (2023).
3. Shipp, M. A. et al. A predictive model
for aggressive non-hodgkin’s lymphoma. New England Journal of
Medicine 329, 987–994 (1993).
4. Zhou, Z. et al. An enhanced
international prognostic index (NCCN-IPI) for patients with diffuse
large b-cell lymphoma treated in the rituximab era. Blood
123, 837–842 (2014).
5. Fisher, R. I. et al. Comparison of a
standard regimen (CHOP) with three intensive chemotherapy regimens for
advanced non-hodgkin’s lymphoma. New England Journal of
Medicine 328, 1002–1006 (1993).
6. Johnson, N. A. et al. Concurrent expression of
MYC and BCL2 in diffuse large b-cell lymphoma treated with rituximab
plus cyclophosphamide, doxorubicin, vincristine, and prednisone.
Journal of Clinical Oncology 30, 3452–3459
(2012).
7. Carella, A. M. et al. Prognostic role of
gender in diffuse large b-cell lymphoma treated with rituximab
containing regimens: A fondazione italiana linfomi/grupo de estudos em
moléstias onco-hematológicas retrospective study. Leukemia &
Lymphoma 54, 53–57 (2013).
8. Habermann, T. M. Is rituximab one for
all ages and each sex? Blood 123, 602–603
(2014).
9. Gascoyne, R. D. et al. Prognostic significance
of bcl-2 protein expression and bcl-2 gene rearrangement in diffuse
aggressive non-hodgkin’s lymphoma. Blood
90, 244–251 (1997).
10. Lo
Coco, F. et al. Rearrangements of
the BCL6 gene in diffuse large cell non-hodgkin’s lymphoma.
Blood 83, 1757–1759 (1994).
11. Chenevix-Trench, G., Behm, F. G. & Westin,
E. H. Somatic
rearrangement of the c-myc oncogene in primary human diffuse large-cell
lymphoma. International Journal of Cancer
38, 513–516 (1986).
12. Petrich, A. M., Nabhan, C. & Smith, S. M.
MYC-associated and
double-hit lymphomas: A review of pathobiology, prognosis, and
therapeutic approaches. Cancer 120,
3884–3895 (2014).
13. Klapper, W. et al. Patient age at
diagnosis is associated with the molecular characteristics of diffuse
large b-cell lymphoma. Blood 119,
1882–1887 (2012).
14. Kurz, K. S. et al. Age- and gender-specific
molecular characteristics of diffuse large b-cell lymphoma: Results from
clinical trials of the DSHNHL/GLA. HemaSphere
9, e70093 (2025).
15. Reddy, A. et al. Genetic and functional
drivers of diffuse large b cell lymphoma. Cell
171, 481–494.e15 (2017).
16. Reddy, A. et al. Supplementary data
for "genetic and functional drivers of diffuse large b cell lymphoma":
Table S1 (clinical information and genetic alteration data for 1,001
DLBCL samples) and table S2 (ABC/GCB classification using gene
expression data). (2017).
17. R
Core Team. R: A Language and
Environment for Statistical Computing. (R Foundation for
Statistical Computing, Vienna, Austria, 2025).
18. Wickham, H. & Bryan, J. Readxl: Read
Excel Files. (2025). doi:10.32614/CRAN.package.readxl.
19. Wickham, H. et al. Welcome to the tidyverse. Journal of Open Source
Software 4, 1686 (2019).
20. Müller, K. Here: A Simpler Way to Find Your
Files. (2025). doi:10.32614/CRAN.package.here.
21. Xie, Y. Knitr: A General-Purpose Package for
Dynamic Report Generation in r. (2025).
22. Wright, G. et al. A gene expression-based
method to diagnose clinically distinct subgroups of diffuse large b cell
lymphoma. Proceedings of the National Academy of Sciences
100, 9991–9996 (2003).
23. Lossos, I. S. et al. Prediction of survival in
diffuse large-b-cell lymphoma based on the expression of six genes.
New England Journal of Medicine 350, 1828–1837
(2004).
24. Zhu, H. kableExtra: Construct Complex Table
with ’Kable’ and Pipe Syntax. (2024). doi:10.32614/CRAN.package.kableExtra.
25. Kassambara, A. Rstatix: Pipe-Friendly
Framework for Basic Statistical Tests. (2025). doi:10.32614/CRAN.package.rstatix.
26. Reiner, A., Yekutieli, D. & Benjamini, Y.
Identifying
differentially expressed genes using false discovery rate controlling
procedures. Bioinformatics 19, 368–375
(2003).
27. Rongen, M. van et al. Core statisics.
(2025).
28. Robinson, D., Hayes, A. & Couch, S.
Broom: Convert Statistical Objects into Tidy Tibbles. (2025).
doi:10.32614/CRAN.package.broom.
29. Hodgson, V., Castle, M., Nicholls, R. &
Rongen, M. van. Generalised linear
models. (2025).
30. Sievert, C. Interactive Web-Based Data Visualization
with r, Plotly, and Shiny. (Chapman; Hall/CRC, 2020).
31. Graves, S., Piepho, H.-P. & Sundar
Dorai-Raj, L. S. with help from. multcompView: Visualizations of
Paired Comparisons. (2024). doi:10.32614/CRAN.package.multcompView.
32. Neuwirth, E. RColorBrewer: ColorBrewer
Palettes. (2022). doi:10.32614/CRAN.package.RColorBrewer.
33. Lenth, R. V. & Piaskowski, J. Emmeans:
Estimated Marginal Means, Aka Least-Squares Means. (2025). doi:10.32614/CRAN.package.emmeans.
34. Hasselblom, S., Ridell, B., Nilsson-Ehle, H.
& Andersson, P.-O. The impact of gender,
age and patient selection on prognosis and outcome in diffuse large
b-cell lymphoma—a population-based study. Leukemia &
Lymphoma 48, 736–745 (2007).
35. Dhanasekaran, R. et al. The MYC oncogene — the
grand orchestrator of cancer growth and immune evasion. Nature
Reviews Clinical Oncology 19, 23–36 (2022).
36. Tsuyama, N. et al. BCL2 expression in
DLBCL: Reappraisal of immunohistochemistry with new criteria for
therapeutic biomarker evaluation. Blood
130, 489–500 (2017).
37. Hans, C. P. et al. Confirmation of the
molecular classification of diffuse large b-cell lymphoma by
immunohistochemistry using a tissue microarray. Blood
103, 275–282 (2004).
38. Advani, P. et al. The expression of
bcl-6 in diffuse large b cell lymphoma (DLBCL) is associated with
improved progression free survival and overall survival following
front-line rituximab-chemotherapy. Blood
116, 5073–5073 (2010).
39. Bouchard, C. et al. Regulation of cyclin D2 gene
expression by the myc/max/mad network: Myc-dependent TRRAP recruitment
and histone acetylation at the cyclin D2 promoter. Genes &
Development 15, 2042–2047 (2001).
40. Wang, D., Zhang, Y. & Che, Y.-Q. CCND2 mRNA expression is
correlated with r-CHOP treatment efficacy and prognosis in patients with
ABC-DLBCL. Frontiers in Oncology Volume 10 -
2020, (2020).
41. Yi,
L. et al. Palbociclib enhances
the effect of doxorubicin-induced apoptosis in activated b-cell–like
diffuse large b-cell lymphoma cells. Anti-Cancer Drugs
34, 257–268 (2023).
42. Begley, C. G. & Ellis, L. M. Raise standards for preclinical
cancer research. Nature 483, 531–533
(2012).
43. Ziemann, M., Poulain, P. & Bora, A. The five pillars of
computational reproducibility: Bioinformatics and beyond.
Briefings in Bioinformatics 24, bbad375
(2023).
Authentication of practice
| I confirm that I have fully read and understood the assignment brief for this module. | Y |
Details
| Name: Chung Yan | Surname: Yu |
| Submission Date | 19-10-25 |
| Word Count: whole assignment including codes | |
| Word Count: main body excluding abstract, references and supplementary materials |
University statement of originality
This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.
| I confirm the statement of originality as above | Y |
Questions for reflection
Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.
For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.
- Which aspects of this assignment are you most uncertain about and/or would most like to receive feedback on?
- What elements are you left pondering after this assignment that you would like to discuss further?
- How have you incorporated feedback from peers and tutors into this assignment?
- How, and to what extent, have you been able to incorporate feedback on previous course work into this assignment?
- Using the wording in the rubric, how would you describe the quality of the different aspects of your work?
Declaration of the use of generative AI
| Which permitted use of generative AI are you acknowledging? | Semantic search of literature and notes, outline creation, code debugging, output formatting and generation, informed feedback |
| Which generative AI tool did you use (name and version)? | Claude Code v2.0.3x (the version likely changed over usage), Claude Desktop v1.0.3x with PubMed Connector, M365 CoPilot |
| What did you use the tool for? | Searching for relevant journal publications, sanity check on coding strategies, collating notes in my Obsidian vault, debugging code by parsing error messages |
| How have you used or changed the generative AI’s output | E.g. my collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on. Code generated are checked, e.g. it tried to run Levene test on a model fitted using residuals as the response, this was rectified back to the relevant response variable. |
Abstract
Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, with prognosis influenced by demographic features (age, gender), integrated scoring systems such as the International Prognostic Index (IPI) and genetic alterations. Oncogenes MYC, BCL2, BCL6 emerged with strong DLBCL associations, and further studies have identified over 150 genetic drivers. Yet whether demographic factors modify these genetic markers’ prognostic effects remains understudied. This study analysed the gene expression data of 21 genes from 775 DLBCL patients to investigate whether age and gender modify gene expression-response associations. Ten genes demonstrated significant age-dependent expression levels (FDR adjusted p < 0.05), including BCL2, CCND2 and LMO2. However, hierarchical logistic regression models revealed only CCND2 showed significant interaction effects (Likelihood ratio test adjusted p = 0.006) with age when it comes to association with complete response. Older age groups showed significant increase (61-75 odds ratio p = 0.014; >75 odds ratio p = 0.001) in complete response probability with higher CCND2 expression, while younger patients (≤40, 41-60) showed the opposite insignificant trend. Gender revealed no significant effects (additive or modifying) across all genes. Only MYC gene expression level showed significant gene-only association (likelihood ratio test adjusted p = 0.010) with response completeness. While the modifying effects of CCND2 and age on prognosis could reconcile certain conflicting viewpoints in DLBCL, this dynamic requires further scrutiny and replication elsewhere. With only one out of 42 tested gene-demographic combination (21 genes × 2 demographic groups) showing interactive effects, demographic modifying effects remains a rarity amongst genetic markers. Thus a more conservative and universal approach with these genomic markers could still be the way forward.
Introduction
Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin’s lymphoma (NHL)1,2, with a wide range of factors contributing to prognosis. For a comparable and more comprehensive pre-treatment prognosis model, the international prognostic index (IPI) emerged as a standardised model for DLBCL and other aggressive NHLs3, combining 5 factors: age, Ann Arbor stage classification, ECOG Performance Status, serum lactate dehydrogenase (LDH) level and the number of extranodal sites of disease. By pooling factors together, the IPI outperformed simple disease stage classification models such as Ann Arbor.4 went on to improve on the IPI, creating the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) as an update to the IPI. A key innovation of the NCCN-IPI was using cubic splines to model the continuous factors of age and LDH levels as higher resolution categorical factors, modifying them from binary factors to quaternary and ternary factors respectively. As prognosis models improved, so did treatment. The first generation chemotherapy of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of 30-35%5. With the addition of rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes curable for more than 60% of patients6.
While age is incorporated into the IPI (and subsequent NCCN-IPI), other demographic factors such as gender are not, despite established evidence for their prognostic involvement. Males demonstrate higher hazard ratios7 and gender-associated pharmacokinetic differences in rituximab lead to worse treatment responses in males8. These demographic effects raise questions about whether risk factors perform uniformly across patient subgroups. Beyond demographic and clinical factors, a series of genetic alterations have emerged as strong prognostic markers: BCL29, BCL610 and MYC11. These changes are observed at both the genotypic and phenotypic levels, as translocation mutations and protein overexpression respectively12. Furthermore, patients with varying combinations of these alterations have shown worse responses to R-CHOP treatments, highlighting the clinical relevance of genomic profiling. Some of these genetic markers, such as BCL613 and MYC14, have shown association with patient age, adding another layer of interaction that could influence prognosis.13 further demonstrated that IRF4 translocations are associated with age, and along with BCL6 and other genetic drivers, collectively lose prognostic significance when age is incorporated in multivariate models. This indicates that demographic factors may not merely add to genomic risk additively, but could modify how these genomic drivers influence clinical outcomes.
With the advent of machine learning methods came attempts to construct models based on clinical and genomic data.15 performed whole exome and transcriptome sequencing in 1001 DLBCL patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC alongside numerous newly characterised candidates. By combining genotypic alterations, gene expression data and cell-of-origin classification, they developed a genomic risk model that outperformed the IPI. However, the potential for demographic factors to modify these genomic associations remains understudied. Given that age and gender both demonstrate prognostic significance and influence treatment response, it remains unknown whether genomic risk markers perform uniformly across demographic subgroups. If these demographic factors modify how gene expression levels impact treatment response, such that the same gene expression levels in different gender or age groups lead to different prognostic outcomes, it would be crucial that these differences are identified and applied in future models to prevent inaccurate prognoses.
This study addresses this gap by investigating whether age and gender interact and modify the prognostic associations between gene expression markers and therapy response in DLBCL. The associations between the demographic factors (age & gender) with complete therapy response and the gene expression levels of 21 genes were first considered. Logistic regression models using gene expression levels of 21 genes were fitted with therapy response, and the interaction of demographic factors with these gene expression levels were examined. Exploring how demographic factors interact with known and potential genomic drivers of DLBCL could reveal age or gender-dependent tumour biology that inform therapeutic approaches.
Here the same datasets16 from the genomic risk model study by15 are utilised, specifically their clinical information dataset from Supplementary Table S1 and gene expression dataset from Supplementary Table S2. The clinical information dataset comprises 1001 DLBCL patients with 35 variables. Patients are recorded with anonymised IDs with the variables of interest to us being gender, response to initial therapy, age at diagnosis, and the expression level of MYC, BCL2 and BCL6 genes based on log2 transformation of RNA-sequencing Fragments per kilobase (FPKM) value. These columns are highlighted and used in further analysis not only due to their relevance, but also their relative completeness. The gene expression dataset comprises the gene expression level (log2 transformed again) of 19 genes (1 gene BCL6 is also recorded in the clinical information dataset) for 775 patients recorded using the same anonymised IDs.
Methods
The datasets are available via Elsevier’s open
access license, as shown on the ScienceDirect webpage
of15’s publication. For their
work, the authors obtained anonymised patient clinical information and
tumours, which were processed in line with a protocol approved by the
Institutional Review Board at Duke University. A total number of 2
datasets were downloaded as Excel files directly from their links on the
aforementioned ScienceDirect, Table
S1 and Table
S2 using the download.file function from R’s
utils package17 and read via the
readxl package18. The specific sheets were
selected then cleaned to be exported as .csv files (see 01-data-preprocessing.R20). The
processed .csv files were uploaded to this project’s GitHub
repository: mmc1-ClinicalInformation.csv
(raw), mmc2-GeneExpression.csv
(raw). This RMarkdown document is self-sufficient and contains code
for the relevant analysis and plots shown only, for a complete analysis
including assumption testing, post-hoc analysis, tests that showed
insignificant results, please refer to the supplementary-materials
folder of this repository. This document also supports sourcing the
datasets from 3 sources for redundancy: original ScienceDirect links,
processed .csv files on GitHub and local processed
.csv files when this repository is downloaded. For this
knitted21 HTML document, both datasets were
loaded directly from ScienceDirect’s supplementary materials: S1
Table (clinical data) and S2
Table (gene expression data).
The 2 datasets were joined using common patient IDs and 2 new columns
were added. age_group_nccn was obtained by transforming the
age at diagnosis variable into 4 categorical groups according to cutoffs
defined by4’s NCCN-IPI: \(≤40, 41 - 60, 61 - 75, >75\).
complete_response was added by combining “Partial response”
and “No response” as “Incomplete response” from
response_therapy while keeping “Complete response”
unchanged. This prepared the dataset for hierarchical binomial logistic
regression. As for duplicate gene expression levels for BCL6, the two
columns were compared to verify identicality and 1 was discarded. The
resulting table was one with 775 patients and 27 variables, of which
include the anonymised patient ID; demographic variables such as gender
(binary: F/M), age (continuous, at diagnosis) and age group (quaternary,
as defined by NCCN-IPI); clinical information of therapy response
(ternary: Complete response, Partial response, No response); and the
log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6,
ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1,
BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.
Gene expression data were limited to these 21 genes available from16’s datasets (Supplementary Tables S1 & S2), representing 2 categories. The first category is the “big 3” oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers. Moreover, MYC14 and BCL613 have both shown age-dependent prognostic effects, suggesting age may modify how these markers influence clinical outcomes. The second category comprises 19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB)22, with BCL6 appearing in both categories. Despite their initial utility for classification, gene members such as CCND2 and LMO223 have shown to be strong survival predictors alongside BCL2 and BCL6. More importantly, IRF4 is among the genetic features showing age associations and contributes to genetic complexity that loses prognostic significance when age is incorporated13, alongside BCL6. Both categories therefore contain genes with potential to interact with demographic factors, making them suitable for testing whether demographic factors modify their prognostic associations.
Overall, data completeness for these 27 columns was exceptional, with
23 columns being 100%, and the remaining 4 columns all with more than
93% completeness. Processing is documented in 02-data-processing.R.
Characteristics of the cohort are summarised24 in
Table 1.
Table 1. Baseline Cohort Characteristics
| Characteristic | Total (N=775) | Complete Response | Incomplete Response |
|---|---|---|---|
| Sample size | N = 775 | n = 598 | n = 126 |
| Gender | |||
| Female | 338 (43.6%) | 258 (43.1%) | 54 (42.9%) |
| Male | 437 (56.4%) | 340 (56.9%) | 72 (57.1%) |
| Age at diagnosis, years | 61 ± 15.4 | 59.8 ± 15.7 | 63.1 ± 14.3 |
| Age groups (NCCN-IPI) | |||
| ≤40 years | 75 (10.1%) | 67 (11.5%) | 8 (6.5%) |
| 41-60 years | 249 (33.5%) | 202 (34.8%) | 41 (33.1%) |
| 61-75 years | 285 (38.3%) | 218 (37.5%) | 47 (37.9%) |
| >75 years | 135 (18.1%) | 94 (16.2%) | 28 (22.6%) |
Demographic and clinical characteristics stratified by treatment
response.
Values shown as n (%) for categorical variables and mean
± SD for continuous variables.
Treatment response categories
combine partial and no response as “Incomplete response” versus complete
response.
NCCN-IPI age groups represent National Comprehensive
Cancer Network International Prognostic Index classifications.
Percentages are calculated within each response group for gender and age
group distributions.
First, the variation of gene expression levels of the 21 genes in
different gender and age groups were scrutinised using the Student’s
t-test and one-way Welch’s ANOVA25 respectively, and with false
discovery rate (FDR) adjustment for the p-values26 due to the
number of genes analysed (see 04-data-analysis.R).
Post-hoc Games-Howell tests were then conducted for significant ANOVA
results27. Assumptions for these tests
were checked with diagnostic tests and plots, and the overall
distribution of gene expression against age and gender visually examined
(see 03-data-exploration.R).
Second, the association of therapy response completeness with gender and
age groups were studied with the Chi-squared tests with expected
frequencies checked to be \(> 5\) in
05-response-analysis.R.
Finally, to study potential interactions between demographic factors and gene expression levels on predicting therapy response, hierarchical logistic regression models28 were fitted to systematically examine the effects of these factors and their interactions. The models were nested as such:
- Null model: therapy response ~ 1
- Gene-only model: therapy response ~ gene expression
- Main effect model: therapy response ~ gene expression + demographic factors (age or gender)
- Interaction model: therapy response ~ gene expression + demographic factors (age or gender) + gene-demographic interactions
The fitted models were then analysed with the likelihood-ratio test
(LRT) to obtain FDR-corrected p-values. Significant models were then
checked for model fit with goodness-of-fit with Chi-square tests and
Akaike Information Criterion (AIC) scores, along with influential
observations and collinearity29 (see 06-effect-analysis.R.)
Results
Gene Expression by Demographics
Gene expression levels showed no significant difference across gender groups after t-test with FDR correction, indicating gender does not significantly affect baseline gene expression of these genetic markers. Conversely, 10 genes showed significantly different gene expression activity after Welch’s ANOVA with FDR correction across NCCN-IPI age groups: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.010), SH3BP5 (p = 0.016), ITPKB (p = 0.020), CCND2 (p = 0.020), PTPN1 (p = 0.031), BMF (p = 0.034), FUT8 (p = 0.037), LMO2 (p = 0.038). Post-hoc analysis with Games-Howell revealed that age groups mostly clustered into two statistially distinct groups for each gene, save for BLNK with three groups Figure 132. This points to age-dependent tumour biology at the transcription level.
Figure 1. Gene Expression Patterns Across Age Groups.
Box plots show gene expression distribution by age group for genes with significant Welch’s ANOVA results (FDR-adjusted p < 0.05).
Compact letter display (CLD) letters (a, b, c) above each boxplot indicate statistical groupings from Games-Howell post-hoc tests within each gene
- groups with different letters differ significantly (p < 0.05).
Sample sizes (n) are shown below each age group.
Outliers are
highlighted in darker borders.
Therapy Response by Demographics
Neither gender (p = 1.000) nor age group (p = 0.173) showed significant associations with complete response, although incomplete response proportions did trend upwards as age increased. While neither demographic factors showed prognostic potential on this clinical outcome, whether they modify the gene expression-therapy response relationship is to be explored in the next section.
Therapy Response by Gene Expression
As a part of fitting hierarchical models for each gene with either age or gender groups, gene-only models on predicting therapy response were also fitted (Figure 2). This revealed that one gene out of 21, MYC, displayed significant gene-only effects (LRT FDR-adjusted p-value = 0.010, AIC = 647.57, Chi-square goodness-of-fit p-value = 0.947), with higher expression predicting lower complete response probability. With the lowest AIC score among the 8 MYC models, adding demographic or demographic-MYC intereaction did not improve model fit, indicating MYC’s prognostic effect is consistent across age and gender groups (Figure 3a).
Figure 2. Gene-only Effects on Complete Response
Fig. 2
Orange: Adjusted significant
Blue:
Adjusted insignificant
Lilac: Insignificant
Confidence interval:
95%
MYC is the only gene showing significant gene effect after FDR
correction, being well clear of the 1.0 reference (no effect) reference
line even when considering its CI error bars.
Gene and Demographic Interactions Predicting Therapy Response
Systematic testing of 21 genes across both age and gender yielded 42 gene-demographic models, of which no gender model showed significant effect. Neither gene-gender interaction nor gender main effect models attained significant results for any gene after FDR adjustment. For age, only the CCND2 gene showed significant (LRT FDR-adjusted p-value = 0.006, AIC = 643.12, Chi-square goodness-of-fit p-value = 0.973) effects amongst the interaction models of all 21 genes, and the rest 20 genes showed neither main (additive) nor interactive effects with age. Having the lowest AIC score (min delta = 12.7) amongst the 4 CCND2 models indicate that response completeness is best predicted with gene expression when its interaction with age is factored in.
Clear age-dependent directional effects emerge for CCND2 (Figure 3b), with higher CCND2 expression levels affecting the response completeness of the age groups differently. The curves for age groups ≤40 and 41-60 and the curves for age groups 61-75 and >75 are going in opposite directions. Complete response probability decreases with CCND2 expression level for the 2 younger groups, while complete response probability increases with CCND2 expression level for the 2 older groups. Further scrutinising the odds ratio33 of the 4 age groups (Figure 4) confirms that both older age groups showed significantly higher complete response odds (61-75 p = 0.014; >75 p = 0.001), and younger age groups (≤40, 41-60) have trends for worse but statistically insignificant odds.
Figure 3. Logistic Regression Models - Gene Expression Effects on Complete Response
Fig. 3
A. Logistic regression
plot of MYC gene expression against complete response probability,
age-interaction model
B. Logistic regression plot
of CCND2 gene expression against complete response probability,
age-interaction model
The MYC gene showed no significant
age-interaction effects, contrasting the curves of the different age
groups for CCND2.
Figure 4. CCND2 Interaction Effects - Age Group Modifying Gene Expression Effects on Complete Response
Fig. 4
Odds ratio of age groups
Orange:
Significant
Lilac: Insignificant
Confidence interval: 95%
Both
older age groups showed significant modifying effect.
Discussion
While gender had displayed potential to influence clinical outcomes34, the NCCN-IPI did not include gender as the authors found gender did not substantially enhance their model’s predictive capability4. The lack of gender-dependent gene expression patterns and gene-modifying effects on prognosis in this analysis further confirms observation.
Of the big three oncogene drivers of DLBCL, only MYC showed significant (gene-only) effects. MYC’s inclusion was in a sense expected as its role as an oncogene is well established35. Whereas increase in BCL2 expression has been associated with inferior outcomes36 and BCL6 expression bodes well for prognosis38. Both genes however have significant LRT p-values when not corrected by FDR, BCL2’s gene-only model (unadjusted p = 0.039) and BCL6’s interaction model (unadjusted p = 0.037), so perhaps their effects were merely not significant enough for this dataset.
Connecting the two genes of significance of this study, MYC and CCND2, is that the oncoprotein product of MYC actually binds to the promoter of CCND2 in vivo39, yet within this dataset, the expression of these genes affect different age groups differently. The revelation of CCND2’s age modifying effects might actually provide insights to one of the more contested topics in the field, as there are conflicting opinions and associations when it comes to CCND2, where the activity of the gene could indicate better40 and worse41 prognoses. Obviously, more work needs to be conducted to establish this modifying effect of age-CCND2 on prognosis is robust and not spurious.
In the end, it has to be acknowledged that this study is not without flaws. The available dataset, especially gene expression data, was pre-selected for publication, not for the exploratory analysis of demographic effects on the prognostic powers of genomic drivers. The dataset also lacked ethnicity information, when it was likely patients from different continents were included. It would also be of great boon to the study to be able to benchmark the analysis here with Reddy’s genomic risk model. Unfortunately, the (hyper-)parameters required to rebuild such model along with the interactive webtool for this model are no longer accessible. This further highlights reproducibility in cancer research42 and bioinformatics43 in general, and make me appreciate the emphasis of robust replicability our lecturers had imparted on us during the first module.
Conclusion
This analysis of 775 DLBCL patients discovered 10 genes BCL2, BLNK, MYBL1, SH3BP5, ITPKB, CCND2, PTPN1, BMF, FUT8 and LMO2, showed age-dependent gene expression levels. One gene CCND2 showed modifying effects with age on predicting therapy outcomes, and the MYC gene showed gene only effects with age. The discovery of CCND2’s modifying effect could partially resolve the conflicting accounts of how CCND2 activity associate with prognosis. This is an exciting find that still needs work to replicate and validate.